-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Properly cache object-stores #14598
Conversation
Will give it a try tomorrow |
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## main #14598 +/- ##
=======================================
Coverage 77.95% 77.95%
=======================================
Files 1326 1326
Lines 173054 173061 +7
Branches 2448 2448
=======================================
+ Hits 134896 134904 +8
+ Misses 37714 37713 -1
Partials 444 444 ☔ View full report in Codecov by Sentry. |
@ritchie46 I might not 100% get how long-lived the objet store is but can't using url as key not leak credentials between queries? If one call is made with a one set of credentials and another is made with a second set of credentials the initial set of credentials will be used since they match on url? |
@ritchie46 still encoutered the issue (though it seems like i'm running into it at a lower rate?). |
Ok. Will wrap them with a LimitStore as well. |
@ritchie46 when I say lower rate, I mean it's happens maybe 10x-20x less frequently now. Seems like this might've largely fixed the issue. I think you likely can use a higher concurrency budget by default now? The only way I can get it to trigger now is by using a much higher than default concurrency budget (which does seem to provide a fairly significant increase in network throughput for me). |
Right. So on the default settings it doesn't happen? That's great to hear. As I could not really understand how a What is you network throughput per concurrency budget? |
Nezt week I will be able to make a setup to optimize this case. In the meantime I am pleased to hear we can actually achieve 1.4gb/s. Doesn't sound bad. Maybe we should set the default budget to 3x vCPU. Currently it is 15 I believe. For poor internet connections we must be conservative. Ideally I'd make this dynamic and increase budget as long as bandwidth improves. |
polars/crates/polars-io/src/pl_async.rs Line 23 in b205950
Seems like it's set to the max of your num threads (so for me it was probably set to 96 by default) and 10? Yeah, i think fully decoupling the downloading and parquet parsing/reading could probably greatly improve utilization and speed. I'm guessing because there's only 5 rowgroups per file and only 50k rows per RG (and i'm only reading one column), we're probably parallelizing at too finegrained of a level, since it seems like parallelization occurs at a per file level too? |
Yes, but the challenge is that we don't know that. It might be your reading one file with many row-groups. For the rayon decoding it shouldn't really matter of tasks are small. We also have the challenge of stopping early at a limit. So we cannot download all at once. Ealier files should have preference. Another challenge is that budgets on AWS can be much higher than on local machines. I first want to see if we can reduce granularity on downloading if we have info about the downloading finger print. E.g. no. of files, rgs projection etc. |
Hi there. Is there a way to manually trigger (from Python) a clear out of the object store cache? The way things are set up in my company, AWS s3 credentials are only valid for some set period of time (30-40 minutes). We have a process for periodically acquiring new credentials, but when we pass them in with the storage_options parameter to scan_parquet, we end up crashing with an S3 error… I’m guessing because of this cache. So I guess one solution would be to provide ability to manually clear the cache (if not already available). Another could be to make the cache key based on some hash of everything in the storage options parameter. |
Might help #14384
Closes #14572
@kszlim can you check?